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O ' Abstract 

In this paper, we give a new sharp generalization bound of ^ p -MKL which is a generalized 
framework of multiple kernel learning (MKL) and imposes £ p -mixed-norm regularization 
instead of ^i-mixed-norm regularization. We utilize localization techniques to obtain the 
sharp learning rate. The bound is characterized by the decay rate of the eigenvalues of the 
associated kernels. A larger decay rate gives a faster convergence rate. Furthermore, we 
give the minimax learning rate on the ball characterized by £ p -mixed-norm in the product 
^T 1 ' space. Then we show that our derived learning rate of £ p -MKL achieves the minimax 

optimal rate on the £ p -mixed-norm ball. 

1 Introduction 

Multiple Kernel Learning (MKL) proposed by lLanckriet et al.l (|2004l) is one of the most promis- 
ing methods that adaptively select the kernel function in supervise d kernel learning. Kernel 
' method is widely used and sev eral studies have supported its usefulness ([Scholkopf and Smolal . l2002t 

IShawe- Taylor and Cristianinil . l2004f ) . However the performance of kernel methods critically relies on 
fSJ ■ the choice of the kern e l func tion. Many methods have been proposed to deal with th e issue of kernel 

selecti on. lOng et ah! (|2005l ) studied hyperkrenels as a kernel of kernel functions. lArgyriou et al.l 
s ! , (|2006l ) consi dered DC programmin g approach to learn a mixture of kernels with continuous parame- 

ters (see also l Argvriou et al.l ( 2005 ) ) . Some studie s tackled a problem t o lear n non- linear combination 

cn 



of kernels as in iBachl (12009). iCortes et al.l (|2009aD . [Varma and Babul (|2009f ) . Among them, learning 



a linear combination of finite candidate kernels with non-negative coeffici ents is the most basic , 
fundamental and commonly used approach. The seminal work of MKL by lLanckriet et al.l (|2004f ) 
considered learning convex combinatio n of candidate kernels. This work opened up the sequence 
of the MKL studies. Bach et al. (12004 ) showed that MKL can be reformulated as a kernel version 
of the group lasso (|Yuan and Linl . 20061 ). This formulation gives an insight that MKL can be de- 
scribed as a l\ regularized learning method. As a generalization of MKL, £ p -MKL that imposes 

£p-mixed-norm regularization {J2m=i \\fm\\n m wrt hp > 1) has been proposed (|Micchelli and Pontill . 
120051 . iKloft et aUl2009D . where {% m }„ =1 are M reproducing kernel Hilbert spaces (RKHSs) and 
f m <= T~Lm- ^p-MKL includes the original MKL as a special case of <i-MKL. One recent perception is 
that l p -MKL with p > 1 s hows better performances than £i-MKL in several situations ()Kloft et all 
120091 ICortes et all l2009bl) . To just ify the usefulness of l p -MKL, a fe w papers have given theoretical 



analyses of £ P -MKL (jCortes et all I2009U l201Ct IKloft et all l2010al). In this paper we give a new - 
faster learning rate of £ D - MKL utilizing the localization techniques ( van de Geerl . 12000 . iBartlett et all 



[20051 120061 iKoltchinskiil . [2006h , and show our learning rate is optimal in a sense of minimaxity. This 
is the first attempt to show the fast localized learning rate for ^ p -MKL. 

In the pioneering paper of lLanckriet et ail (|2004f ). a convergence rate of MKL is given as \f~^[, 

where M is the number of given kernels and n is the number of samples. ISrebro and Ben-Davidl 
(l200l gave improved lea rning bound utilizing the pseudo-dimension of the given kernel class. 
lYing and Campbelll (|2009) gave a convergence bound utilizing Rademacher chaos and g ave some up- 
per bou nds of the Rademacher chaos utilizing the pseudo-dimension of the kernel class. ICortes et al.l 
(2009bJ) presented a convergence bound for a learning method with Li regularization on the kernel 

weight. ICortes et al.l (|2010t ) showed that the convergence rate of ^-MKL is J log ( M ) , They gave 



also the convergence rate of £ p -MKL as M ^ for P > 1- iKloft et ah! (|2010aft gave a similar con- 
vergence bound with improve constants. IKloft et al.l (|2010bl ) generalized the bound to a variant of 
the elasticnet type regularization and widened the effective range of p to all range of p > 1 while 
in the existing bounds 1 < p < 2 was imposed. Our concern about the existing bounds is that 
all bounds introduced above are "global" bounds in a sense that the bounds are applicable to all 
candidates of estimators. Consequently all convergence rate presented above are of order with 
respect to the number n of samples. However, by utilizing the l ocalization techniqu es including 
so-calle d local Rademache r complexity (|Bartlett et all 120051 120061 iKoltchinskiil . I2006D and peeling 
device (Ivan de Geerl . l2000l) . we can derive a faster learning rate. Instead of uniformly bounding all 
candidates of estimators, the localized inequality focuses on a particular estimator such as empirical 
risk minimizer, thus can gives a sharp convergence rate. 

Localized bounds of MKL have been given mainly in sparse learnin g settings such as fi-MKL 
or elasticnet type MKL ( Sh awe-Tavlorl 120081 [Tomioka and Suzukil . [2001 ) . The first localized bound 
of MKL is derived by IKoltchinskii and Yuan (|2008f ) in the setting of 4-MKL. The second one was 
given by iMeier et al.l (120091) who gave a near optimal convergence for elasticnet type regularization. 
Recently IKoltchinskii and Yuan! ( 20101 ) considered a variant of ^i-MKL and showed it achieves the 



minimax optimal convergence rate. All the localized convergence rates were considered in sparse 
learning settings. The localized fast learning rate of £ p -MKL has not been addressed. 

In this paper, we give a sharp convergence rate of ^ p -MKL utilizing the localization techniques. 
Our bound also clarifies the relation between the convergence rate and the tuning parameter p. The 

resultant convergence rate is M 1_ i , ( 1 + a ) n~~ R p ll+s) where R p — (J2m=i WfmWn )* determined by 
the true function /* and s (0 < s < 1) is a constant that represents the c omplexity of RKHS s and 
satisfies < s < 1. The bound includes the bound of ICortes et al.l (|2010l ). IKloft et all (|2010aD as a 
special case of s — >• 1. Finally, we show that the bound for ^ p -MKL achieves the minimax optimal 

rate in the ball with respect to ^ p -mixed-norm {/ = J2m=i f' m I (El=i — R}- This 

indicates that £ P -MKL is compatible with i'p-mixed-norm. 

2 Preliminary 

In this section we give the problem formulation, the notations and the assumptions for the conver- 
gence analysis of ^ P -MKL. 

2.1 Problem Formulation 

Suppose that we are given n i.i.d. samples {(xi, ?/i)}™ = i distributed from a probability distribution 
P on X x M that has the marginal distribution LI on X. We are given M reproducing kernel Hilbert 
spaces (RKHS) {H m }^f =1 each of which is associated with a kernel k m . £ p -MKL (p > 1) fits a 
function / = J2 m =i fm e ^m) to the data by solving the following optimization problem^: 

M N / M \ 2 / M \ I 

/=£/ m = argmin U - ^ f m ( Xi )} + \^ ^ H/ml&J • W 

m=1 f m en m (m=i,..,M) n j=1 y m=1 J \ m=1 J 

This is reduced to a finite dimensional optimization problem due to the representer theorem 
(|Kimeldorf and Wahbal. Il97lh. Th e problem i s convex and thus there ar e efficient algorithms to 



solve that, e.g.. IKloft et al.l (|2009l l2010a|) and IVishwanathan et al.l ((2010). In this paper, we fo- 
cus on the regression problem (the squared loss). However t he discussion present ed here can be 
generalized to Lipschitz continuous and strongly convex losses (jBartlett et all 120061 ). 

Sometimes the regularization of £ p -MKL for 1 < p < 2 is imposed in terms of the kernel weight 

as 

1 N M M p 

mm - Y J (y l - f(xi)f + \ { ( l) \\f\\ 2 Uk s.t. k e = V e m k m , Y] = i, e m > o, 

y i—l m—1 m— I 

(2) 

where Hk e is the RKHS corresponding to the kernel kg. However these two formulations are com- 
pletely same, that is, we obtain the same resultant solution in both formulations (see Lemma 25 of 



1 One might like to use $^m=i ll/ m ll?< m instead of ^53m=i ll/ m llw m )*' as regularization. However this 

difference does not matter because by adjusting the regularization parameter A]™ there is a one-to-one 
correspondence between the solutions of both regularization types. 
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Table 1: Summary of the constants we use in this article. 



n 


The number of samples. 


M 


The number of candidate kernels. 


s 


The spectral decay coefficient; see (A3). 


km 


The smallest eigenvalue of the design matrix (see Eq. ((6|). 


R p 


The £p-mixcd-norm of the truth: (J^ m _ j | \f* n \ % ) 5 • 



iMicchelli and Pontill ()2005D and iTomioka and Suzuki (|2010f ) for details). Moreover our formulation 
(H|) also covers the situation of p > 2 while the kernel weight constraint formulation is restricted to 
1 <P < 2. 

2.2 Notations and Assumptions 

Here, we prepare notations and conditions that are used in the analysis. 

Let H ® M = Hi® - --®H M - Throughout the paper, we assume the following technical conditions 
(see also (|Bachl 12008^ . 

Assumption 1 (Basic Assumptions) 

(Al) There exists f* = (/*,..., / M ) G H® M such that E\Y\X] = f*(X) = ^ =1 /;(I), and the 
noise e := Y — f*(X) is bounded as |e| < L. 

(A2) For each m = 1, ...,M, H m is separable (with respect to the RKHS norm) and 
(X,X)\<1. 

The first assumption in (Al) ensures the model H® M is correctly specified, and the technical as- 
sumption |e| < L allows e/ to be Lipschit z continuous with resp ect to /. The noise boundedness can 
be relaxed to unbounded situation as in (|Raskutti et all l2010t ). but we don't pursue that direction 
for simplicity. 

Due to Mercer's theorem, there are an orthonormal system {4>k,m}k,m in -^(n) and the spectrum 
{^k.m.}k,m such that k m has the following spectral representation: 

oo 

k m (x,x') = 

(x)4> k , m (x / ). (3) 

fe=i 

By this spectral representation, the inner-product of RKHS can be expressed as (/ m ,ffm)« m = 

Dfcll f-k^ifm^k.m) L 2 (n)(0fc,m, 9m) L 2 (n), for /m,5m € H m . 

Constants we use later are summarized in Table [TJ 
Assumption 2 (Spectral Assumption) There exist < s < 1 and < c such that 
(A3) |Ufc,m<cfc _ », (1 < Vfc, 1 < Vm < M), 

where {fJ.k,m}k is the spectrum of the kernel k m (see Eq.^fy). 

It was sho wn that the spectral a ssumption (A3) is equivalent to the classical covering number as- 
sumption (|Steinwart et all [20091 ). Recall that the e-covering number N(e, B-n m , L 2 (n)) with respect 
to i^2(n) is the minimal number of balls with radius e needed to cover the unit ball Bn m in H m 
()van der Vaart and Wellnerl . ll996l) . If the spectral assumption (A3) holds, there exists a constant C 
that depends only on s and c such that 

logN(e,Bn m ,L 2 (U))<Ce- 2s , (4) 

and the converse is also true (see ISteinwart etail ([2009, Theorem 15) and ISteinwartl (|2008l ) for 
details). Therefore, if s is large, the RKHSs are regarded as "complex", and if s is small, the RKHSs 
are "simple". 

Associated with the e-covering number, the i-th entropy number ei(H m —> L2(n)) is defined as 
the infimum over all e > for which N(s, Bu m , L 2 {H)) < ^ x ■ If the spectral assumption (A3) 
holds, the relation (U) implies that the i-th entropy number is bounded as 

ei(H m ^L 2 (U))<Cr^, (5) 

where C is a constant. To bound empirical process a bound of the entropy number with respect to 
the empirical dis tribution i s need ed. The following proposition gives an upper bound of that (see 
Corollary 7.31 of ISteinwartl (|2008l) . for example). 
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Proposition 1 If there exists constants < s < 1 and C > 1 such that ei(H m — > ^(n)) < Ci 2s , 
i/ien £/iere exists a constant c s > on/y depending on s such that 

Eu„^n« [e-i(Um -> < c s C(min(i, n)) , 

m particular Er) n ^u n [si('H m - * L>2(D n ))] < c s Ci~^ . 

An important class of RKHSs where s is known is Sobolev space. (A3) holds with s — 
for Sobolev space of w-tim es continuously differentiability on the Euclidean ball of M d 
fvan der Vaart and Wellnerl . Il996l Theorem 2.7.1). Moreover, for m-times differentiable kernels 
on a closed Euc lidean ball in M. d , that holds for s = t4- (jSteinwartl . 120081 Theorem 6.26). According 
to IZhoul (|2002fl . for Gaussian kernels with compact support, that holds for arbitrary small < s. 
The entropy num ber of Gaussian kernels with unbounded support is described in Theorem 7.34 of 
ISteinwartl ((20081 ) . 

Let km be defined as follows: 



km '■= sup < k > 



V M f 112 
Z^m=l h 



m j ,, 2 > V ^^™ (m=l,...,M) . (6) 

Z^m=l H/"illL 2 (n) J 

km represents the correlation of RKHSs. We assume all RKHSs are not completely correlated to 
each other. 

Assumption 3 (Incoherence Assumption) km is strictly bounded from below; there exists a 
constant Cq > such that 

(A4) < Co" 1 < km- 

This condition is motivated by the incoherence condition ()Koltchinskii and Yuanl . 120081 1 Meier et all 
2009) considered in sparse MKL settings. This ensures the uniqueness of the decomposition /* = 

J2m=i fm °f the ground truth. iBachl (|2008t ) also assumed this condition to show the consistency of 
4-MKL. 

Finally we give a technical assumption with respect to oo-norm. 

Assumption 4 (Embedded Assumption) Under the Spectral Assumption (s), there exists a 
constant C\ > such that 

(A5) \\U\oo < C.WUl^JlUH^y 

This condition is met when the RKHSs are continuously embedded in a Besov space B|™(A) where 

s = y~, d is the dimension of the input space X and m is the smoothness of the Besov space. For 
example, the RKHSs of Gaussian kernels can be embedded in all Sobolev spaces, and therefore the 
condition (A5) seems rather common and practical. More generally, there is a clear characterization 
of the condition (A5) in ter ms of real interpolation of spaces. One can fin d detailed and formal 
discus sions of interpolations in lSteinwart et al.l (|2009() . and Proposition 2.10 of lBennett and SharplevI 
(1988) gives the necessary and sufficient condition for the condition (A5). 

3 Convergence Rate of ! P -MKL 

Here we derive the convergence rate of the estimator /. We suppose that the number of kernels M 
can increase along with the number of samples n. The motivation of our analysis is summarized as 
follows: 

• Deriving a sharp convergence rate utilizing localization techniques. 



Clarifying the relation between the norm (J^ m=1 ) p of the truth and the generalization 

bound. 

Now we define 

n(t) := n n (t) = max(l, Vi,t/*/n), 
and for a given positive real A we define 



, /Mlog(M) \-iM~ p M 2 (!+=) p(i+»)A 
" 2 h/ n V V ^ 1 ' ' ' j 

Then we obtain the following convergence rate. 
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Theorem 2 Suppose > and let A in the definition ([7J of Q n ieA= A^ . TTien i/iere exists a 
constant ip s depending L, s, c, C\ such that for all n and t' '(> 0) that satisfy loj ^^ < 1 and 

^^v(t') < 1, 

</ie solution of £ P -MKL given in Eq. ([T]) /or arbitrary real p > 1 satisfies 

11/ - riiLcn) ^ ^c 2 ^w 2 + f f; iiofj) ' , (») 



K ^ 3 

tuit/j probability 1 — exp(— i) — exp(— i') /or all t > 1. 

The proof will be given in Appendix |XJ 

Let Rp :— f^m=i ll/mll?f m J P - Suppose that n is sufficiently large compared with M and R p 

(n > M? R~ 2 (log M)^~ and n > (Rp/Mp) 1 ^ 7 ). Then the regularization parameter A^ that 
achieves the minimum of the RHS of the bound (jSJ is given by 

A^ l) = n-^M^^TRp 
up to constant. Then the convergence rate of ||/ — /*||^ 2 ( n ) becomes 

11/ - /li 2(n) =O p (n-^M^Rp + + n-rh-^ M i-^Rp^). ( 9 ) 

Under the condition n > MpR~ 2 (logM)^~ V (Rp/Mp) 1 ^ 7 , the leading term is the first term, and 
thus we have 

ll/-/lL(n) = Op (n-^*M l -wUR^\ . (10) 

Note that as the complexity s of RKHSs becomes small the convergence rate becomes fast. It is 
known that n _T += is the minimax optimal learning rate for single kernel learning. The derived 
rate of £ p -MKL is obtained by multiplying a coefficient depending on M and R p to the optimal 
rate of single kernel learning. To investigate the dependency of R p to the learning rate, let us 
consider two extreme settings, i.e., sparse setting (||/^||^ )m=i = (1>0, ••-,0) and dense setting 
(II /mlkJ^Li = (1, •••,!) as in lKloft etal 1 (l20T0ah . 

• (Il/mll«m)m=i = (1,0,..., 0): Rp = 1 for all p. Therefore the convergence rate n^~^ M 1 ~ : p 1 ^) 
is fast for small p and the minimum is achieved at p = 1. This means that l\ regularization is 
preferred for sparse truth. 

• (ll/mll'H m )m=i = (!;•■•! 1) : Rp — M~p, thus the convergence rate is Mn~~ for all p. Interest- 
ingly for dense ground truth, there is no dependency of the convergence rate on the parameter 
p. That is, the convergence rate is M times the optimal learning rate of single kernels learn- 
ing (n~~) for all p. This means that for the dense settings, the complexity of solving MKL 
problem is equivalent to that of solving M single kernel learning problems. 

3.1 Comparison with existing bounds 

Here we compare the bound we derived with the existing bounds. Let H.i v (R) be the ^ p -mixed norm 
ball with radius R defined as follows: 

!M / M 

/=E/-| E H/mll?u) ^ R 
rn — 1 \m— 1 / 

The bounds by ICortes et all (|2010l) and iKloft et all (|201Qbl R) are most relevant to our results. 
Roughly speaking, their bounds are given as 

R(f)<R(f) + C Ml PV V 1 °S (M) R foraii/e-^fl), (n) 
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where R(f ) and R(.f) is the p opula tion risk and the e mpirical risk. First observation is that the 
bounds by ICortes et al.l (|2010f ) and iKloft et al.l (|2010a|) are restricted to the situation 1 < p < 2 
because their analysis i s based on the kerne l weight constraint formulation @. On the other hand, 
our analysis ans that of IKloft et all (l2010bh covers all p > 1. Second, since our bound is specialized 
to the regularized risk minimizer / defined at Eq. (Q]) while the existing bound (fTTj) is applicable to all 
/ 6 He (R), our bound is sharper than theirs. To see this, suppose that 1 < p < 2 and M 1_ 5 < 1 
(which means the bound (fTTj) makes sense), then we have n~~ M 1- p< T T 7 T < n^M 1- ?. For the 

situation of p > 2, we have also n i+« M pu+ s ) < n 2 M p for n > M p . Moreover we should 
note that s can be large as long as Spectral Assumption (A3) is satisfied. Thus the bound (fTTj) is 
recovered by our analysis by approaching s to 1. 

The results by iKoltchinskii and Yuan! (|2008| ). iMeier et al.l (|2009[ ) . iKoltchinskii and Yuan! (|2010D 
are also related to ours in terms of the proof techniques. Their analyses and ours utilize the localiza- 
tion techniques to obtain f ast localized learning rate, in contrast to the global bound of ICortes et al.l 
(|2010D . lKloft et all (2010b. a|). However all those localized bounds are considered on a sparse learning 
settings such as l\ and elasticnet regularizations. Hence their frameworks are rather different from 
ours. 

4 Lower bound of learning rate 

In this section, we show that the derived learning rate achieves the minimax-learning rate on Tie (R) . 

We derive the minimax learning rate in a simpler situation. First we assume that each RKHS is 
same as others. That is, the input vector is decomposed into M components like x = {x^\ . . . , x^ M ') 
where {a^ m '}^f =1 are M i.i.d. copies of a random variable X, and Ti m — {f m \ f m (x) — 
f m {x (1 \. . . ,x< M )) = f m (x {m) ), fm G H} where U is an RKHS shared by all U m . Thus / e U® M 
is decomposed as f(x) — f{x^\ . . . , x^ M ^) = J2m=i fm(x^ n ^) where each f m is a member of the 
common RKHS H. We denote by k the kernel associated with the RKHS H. 

In addition to the condition about the upper bound of spectrum (Spectral Assumption (A3)), 
we assume that the spectrum of all the RKHSs R m have the same lower bound of polynomial rate. 

Assumption 5 (Strong Spectral Assumption) There exist < s < 1 and < c, c' such that 
(A6) c'k~i < % < ck~i, (1<VA:), 

where {fik\k is the spectrum of the kernel k. In particular, the spectrum of kernels k m also satisfy 
Hk,m ~ k~~ (Vfc, m). 

As discussed just after Assumption [5j this means that the covering number of H satisfies 

AA( £ ,^,L 2 (n))^ £ - 2s , 

where % is the unit ball of H (see ISteinwart et all (|2009l . Theorem 15) and ISteinwartl (|2008l ) for 
details). Without loss of generality, we may assume that 

E[f(X)} = (V/eH). 

Since each f m receives i.i.d. copy of X, H m s are orthogonal to each other: 

K[f m {X)f m ,{X)] = E{f m (X^)f m ,(X( m ">)} = (V/ m e H m , V/ m - e H m >, 1 < Vm ? m! < M). 

We also assume that the noise {ei}™ =1 is an i.i.d. normal sequence with standard deviation a > 0. 
Under the assumptions described above, then we have the following minimax L2(n)-error. 

Theorem 3 For a given < R, the minimax-learning rate on Ri v (R) is lower bounded as 



min max E 



11/ ~ /*lli 2 (n) 



> Cn tt^M 1 pJ^JR^k 



where inf is taken over all measurable functions of n samples {(xi, ?/i)}™ =1 ■ 

The proof will be given in Appendix [Bj One can see that the convergence rate derived in Theorem 
[5] and Eq. (TTU1) achieves the lower bound of Theorem [31 Thus our bound is tight. Interestingly, the 
learning rate (|10p of ^ p -MKL and the minimax learning rate on ^ p -mixed-norm ball coincide at the 
common p. This means that the ^-mixed-norm regularization is well suited to make the estimator 
included in the £ p -mixed-norm ball. 
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5 Conclusion and Discussion 



Wc have shown a sharp optimal learning rate of l p -MKL by utilizing the localization techniques. Our 
bound is sharper than existing bounds and achieves the minimax learning rate under the Spectral 
Assumption (A3). 

There still remain important future works. The bound given in Eq. (|10[) becomes smaller as 
p becomes smaller since R p /Mp decreases as p \ 1. That is, according to the theoretical result, 
£ p -MKL shows the best performance at p = 1 despite the disappointin g results of p = 1 re ported 
by some numerical experiments. This concern was also pointed out by ICortes et al.l (|2010f ). It is 
an important future work to theoretically clarify why £ p -MKL with p > 1 works well in some real 
situations. The second interesting future work is about the Mlo s( M ) term appeared in the bound 
Eq. ©. Because of this term, our bound is (3(Mlog(M)) with respect to M while in the existing 

work that is 0{M 1 ^^ ). It is an interesting issue to clarify whether the term — lo ^( M ) can be replaced 
by other tighter bounds or not. To do so, it might be useful to precisely estimate the covering number 
of H ip (R). 
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A Proof of Theorem [2] 

Before we show Theorem [2] we prepare several lemmas. The following two propositions are key for 
localization. Let {cri}™ =1 be i.i.d. Rademacher random variables, i.e., <Ji S {±1} and P{pi = 1) = 
P(a s: = -l) = |.' 

Proposition 4 (jSteinwartl . [2008, Theorem 7.16) Let Ba-^^ C H m be a set such that B a . a ,b — 
{f m G T-Lm I ||/mlU 2 (n) < C, \\f m \\u m < a- ll/mlloo < b} . Assume that there exist constants < s < 1 
and < c s such that 

E Dn [ei(n m -> L 2 (D n ))] < c s ri. 
Then there exists a constant C' s depending only s such that 

r x - s (c s a) 



E 



sup 



71 * * 



V (c s a) 1 +°b 1 +° n i+» 



(12) 



Prop osition 5 (Talagrand's Concentration Inequality ( Tala grandl . Il996l . iBousquetl 
2002)) Let Q be a function class on X that is separable with respect to oo-norm, and 
be i.i.d. random variables with values in X. Furthermore, let B > and U > be B := 
sup 9g g E[(g — E[g]) 2 ] and U := sup g6 g ||<?||oo> then there exists a universal constant K such that, for 
Z :=s\rp geg |^Er=i5 , (^)-E[g]|, we have 



P\ Z> K 



r n Bt Ut 
E[Z] + \ — + — 
V n n 



< e" 



Let A > be an arbitrary positive real. We determine U n . s (f m ) as follows: 



/Mlog(Af) A-fM^ i+s(1 "f ) A"w^)JW^ 



(3-s) l + 4 a - 3 ^ s(3-s) 

(1+S) P (l + B) 



||/m|U 2 (n) A2||/ m || Wri 



>M 



It is easy to see U n , s {fm) is an upper bound of the quantity — L ^ u j_ _ Hm \j — 



(this corresponds to the RHS of Eq. (fT2)) ) because 

ll/ m |ll; ( n)ll/-ll^ _ A-fM 1 ?^ 1 -?) f\\f m \\L 2 (n)\^ s f ^\\U\n„ 



'M 
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< 



X-Ul 1 ^^ 1 -^ (\\U\L 2{ n } , A*||/ m || w „ 



(13) 



where we used Young's inequality in the last line, and similarly we obtain 

(l-°) 2 s(3-s 



n l + s 



n T+s 



AI 



Using Propositions [5] and SI we obtain the following ratio type uniform bound. 

Lemma 6 Under the Spectral Assumption (Assumption^) and the Embedded Assumption (Assump- 
tion^), there exists a constant C s depending only on s, c and C\ such that 



E 



sup 

f m £U m :\\f m \\n m =l 



Un,s (/m) 



<c s . 



(14) 



Proof: [Proof of Lemma [6] Let H m (a) := {f m E H m I ||/m||« m = 1> ||/m|U 2 (n) < °"} an d z = 
2 X I S > 1. Define r := A^M 5_ 5. Then by combining Propositions [T] and [4] with Assumption |4l we 
have 



E 



<E 



sup 



f m en m :\\f m \\u m =i U n ^ s (f m ) 



sup 



f m eH m (r) Un yS (f m ) 



k=l 



I n Si=l a ifm{Xi 

sup 77 F7 1 — 



l-s (l-s) 2 2s 



<cl- 



^ a/ 1 - i 



— r 

, 1 + s 



M p 



fe=i 



2 fc (i-s)T 1 - s c; 



s/n VM 



ziz 

ni+f 

°(3-s) (l-s) 2 , ,. l, s(3~s) 



fe=i 



z- 1 ° z i+ 

1 + 7 " V 



1 - Z i+« 



Thus by setting, C s = C'Ac s s \/ cl +s 1 + ^prj V 



s(3-s) / ! 
1-2 l+« 



we obtain the assertion. 



This lemma immediately gives the following corollary. 



Corollary 7 Under the Spectral Assumption (Assumption^) and the Embedded Assumption (As- 
sumption^, there exists a constant C s depending only on s and C such that 



E 



sup 

fmGHr, 



2i=l °~ifm{Xi)\ 
Un,s (/m) 



<c s . 



Lemma 8 If < 1, then under the Spectral Assumption (Assumption^ and the Embedded 

Assumption (Assumption^ there exists a constant C s depending only on s, c, C\ such that 



E 



max sup 

m f m en r . 



<c„. 



8 



UnAM 



Proof: [Proof of Lemma |5] First notice that the L2(n)-norm and the oo-norm of f'^ m /f '| can be 
evaluated by 

Pifm{xi) II/" 



Un,s (/m) 
0~ i fm {x i ) 



i ij(J Q < I / log(M) v A-fMT +s "-?' 



Un,s (/m) 



L 2 (II) U, l:S (f m ) \ V n 

_ ||/m||oo < ^lll/"»lli a (n)ll/"ill 



< 



71 



log(M) ! 



UfL^s infill) 



< Civ 7 ^. 



(15) 
(16) 



The last inequality of Eq. (fT6"| can be shown by using the relation (fT5]) . Thus Talagrand's inequality 
implies 



r, ( \ n 12i=l a 'ifm(xi)\ 

P max sup — — - — -— > K 



M / 

< V PI sup 

m=1 

<Me~ t . 



I n Si=l a ifm{3 



> K 



d + 



t 



dt 



iog(M) 



i t c x t 

log(M) + 7^ 



By setting f i + log(M), we obtain 

p / I^SLl^/m(^)l 

/ max sup 



> K 



C„ 



h + \og(M) Ci(t + log(M)) 
log(M) ^ 



for all i > 0. Consequently the expectation of the max-sup term can be bounded as 

\iYTi=l^.fra{x t )\ 



< e" 



E 



<K 



<2K 



max sup 



d + 1 + 

C, + 2 - 



Ci log(M) 



/>OC 




+ K 


[c, + ] j 


Jo 





'< + l + log(M) Ci(i + l + log(Af)) 



41og(M) 



2d 



Ci log(M) 



log(M) 
<C„ 



y/n 



e _t d* 



where we used + 1 + log(M) < Vt + ^1 + log(M) and J°° y/te^dt 



W log(M) 



< 1, and 



C s = 2K[C s + 2- 



3d]. 



Lemma 9 // lo ^ f ' ) < 1, i/ien under the Spectral Assumption (Assumption^) and the Embedded 
Assumption (Assumption^, the following holds 

/ i i Y^ ™ 



„ I nEi=l £ i/ra( I V j r 

P max sup — ; ; > KL 



2C S + V~t + ^ 



<e- t . 



Proof: [Proof of Lemma [3] By the contraction inequality (jLedoux and Talagrandl [l991. Theorem 
4.12) and Lemma [8j we have 



E 



max sup 



n X/i=l e ifm(Xj)\ 
Un,s {fm ) 



< 2E 



max sup 



< 2Ld 



Using this and Eq. (|T5|) and Eq. ([T^]). Talgrand's inequality gives the assertion. 
Theorem 10 Let <j>' s = K[2dd + d + C?]. Then, if ]2 ^ 1 < I, we have for allt>0 



• M 



L 2 (n) 



\m=l / 



for all f m € "H m (m = 1, . . . , M) with probability 1 — exp(— t). 
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Proof: [Proof of Theorem [TU] 



E 



<2E 



sup 

f m eU T . 



sup 



EM r 
m=l J" 



£ 2 (n) 



n cr i(Sm=l fm( x i)) 



(X/m=l U n ,s(fm) 



< 



sup 



EM ,. 
m=l J" 



E™=1^(/™) 



x 2E 



sup 

/ m 6« T 



n Si=l '''(Em;! fm(Xi)) 



Em=l U n , s (f m ) 



(17) 



where we used the contraction inequality in the last line (jLedoux and Talagrandl Il99l[ Theo 
4.12). Thus using Eq. ([T6]) . the RHS of the inequality dTTJ) can be bounded - 



■rem 



as 



2Ci V / ^E 



<2Ci V / nE 



sup 



sup max 

/ m e« m m 



5Em=l U n ,s(fm) 
I n 5Ei=l @ifm(%i) \ 



Un,s (/m) 

where we used the relation ^ m ^ m < max m (^) for all a rn > and b rn > with a convention § = 0. 

By Lemma [3 the right hand side is upper bounded by 2Ciy/nC s . Here we again apply Talagrand's 
concentration inequality, then we have 



P 



sup 



V M f 



\ 



\ 



Em=l U 7ljS (f m ) 



> K 



2Ci<5 s Vn + VtnCi + Cft 



< e 



-t 



(18) 



where we substituted the following upper bounds of B and U 

\ 2 

(Era=l /ra) 2 



7 



B < sup E 



< sup E 

f m en m 



X/m=l U n ,s{fm) 



(Em=l /m) 



EM ,. 
m=l /" 



< sup 



(g^ H-U^n)) (Eg^ C lv ^ s (/ m )) 2 

ip 2 



cw 



< 



: log(M) 

where in the second inequality we used the relation 

M M M M 

frn) 2 } = ~E[ ^2 fmfm']< ^2 1 1 /™ 1 1 (n) 1 1 1 1 £-2 (n) = ( 1 1 1 1 i 2 (n) ) " 

. — . — 1 — — / 1 — — / 1 — 1 



m — 1 



m.m' — l 



m,m' — 1 



and in the third and forth inequality we used Eq. (1161) and Eq. (1151) respectively. Here we again use 
Eq. ([T5]) to obtain 



U = sup 

f m eu r , 



(Em=l fmY 



Em=l C^n,s(/m) 



10 



Therefore for t 4— y/nt, the above inequality implies the following inequality 



sup 



m=l J m L~im=\ J"' 



£3 (II) 



< K 



2dC s + Ci + cf\ s/nmax(l,Vi,t/y/ri). (19) 



with probability 1 - exp(-i). Remind <f/ a = K 2C X C S + C x + Cf 
Now we define 



, then we obtain the assertion. 



<p s := max (^KL 
We define events <oi(t) and as 



2C S + 1 + Ci 



2CiC s + d + C? 



AW = 



i " 

E^ifmi^i, 



< <PsU n , s (f m )r)(t), V/ m e H m (m = 1, . . . , M) 



Z^m=l J" 



Z^m=l Jrt 



L 2 (n) 



< faV^ |e u »M*jJ v{t'), 



Vf m eH m (m = l,...,M)|. 
The following theorem immediately gives Theorem [5] 



Theorem 11 Let A fee an arbitrary positive real, and A^ satisfy X\ n> > ^A. TTien /or al/ n and 
t'(> 0) £/iai satisfy log (y) < 1 and ips^fnC^rj^') / km < 4, we Ziawe 



>(») ^ l 



ii/ -riii.cn) < d^ (t)2 ^ c " + |aw ( £ uai^ 



\m— 1 



to^/i probability 1 — exp(— i) — exp(— i') /or all t > 1. 

This gives Theorem [5] by setting V>s = 80 s . 

Proof: [Proof of Theorem [TT] Since = f*(xi) + ej, we have 

/ M \ I 

ll/-/lL(n)+Ai n) Ell/"II«J 



<(ii/-riii 2( n)-ii/ 



n M / M 

r iin) + - E E e *(/™(^) - + A i n) E n/™HL 

n— 1 m — 1 \m=l 



Here on the event ^(i 7 ), the above inequality gives 



M 



ll/-/li 2( n)+Ai n) Ell/™ll*. 



(M \ 2 n M / M 

E ^.-(/™ - /m) + - E E - /m(^)) + E WfrnWn, 

711—1 / n=l m—1 \ m — 1 



(20) 



Before we show the statements, we show two basic upper bounds of X)m=i U n ,s{fm) required in 
the proof. First note that 



Em=l H/m|U 2 (n) . A5 Ylm=\ \\fm\\u 



M 



M 



< 



'M 



E ii/- 



lL(n) 



a- E ii/- 
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M / M 



<2 Ell/™HL(n) + A Ell/™H«J 
V m— 1 \m=l / 

Therefore we have 

E ^(/™) ^ 2 v — ^ v ^ v ' 



m— 1 



M / M 

" lm\\n K 



En/-nL(n)+A Ell/-"" 



.m— 1 \m— 1 / 

Reminding the definition of („ (Eq. (O), the above bound is equivalent to 

M ( M / M \ ! 



E UnAU <Cn E ii-ULro + A E ii-U^L ■ ( 21 ) 



m— 1 \ m — 1 \m— 1 



By Eq. (f2~Tj) . the first term on the RHS can be upper bounded as 



/ M \ " / M / M \ 

E ^(/™ - /m) < ^v^c^(i') E ii/- - /mii! 2( n) + a E ii/™ - /™n« m 

\m— 1 / \ m— 1 \m=l / 

<*.vs&ko f IL/ ' /1li2(n) + a f e ii/™ - n 



km 



By assumption, we have (psy/nCn 1 !^') I k m < g- Hence the RHS of the above inequality is bounded 
by 

g^ii/-riiL(n)+A^Eii/™-/-ii^) j- ( 22 ) 

.Step J?. On the event £i(t), we have 

n M M 

-EE ^(/™(^) - /«(*<)) < E v{t)4>sU n Mm - P m ) 

n— 1 m— 1 m— 1 

/ M / M 

< #)0 s c™ E II/- - /™HL(n) + a E II/* - /™H* 



< -?-•?(*) w + ^ I E n/« - /™HL(n) + a ( E II/™ - /™n«, 



m— 1 



M / M 



km 



\m— 1 



< — »/(t)V^ + 5 (II/- /*HL(n) + A ( E II/™ ~ f*m\\u m ) ] ■ (23) 



\m— 1 



5£ep 5. 

Substituting the inequalities (|22j) and (|23|) to Eq. (f20|). we obtain 



„ / M 

iiiz-riiLw+A^ Eli/ 



m— 1 
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<-%(*) 2 ^c* + j ( E II/- - P + A i n) f E n/X.) " ■ 

KM * \m=l / \m=l / 

Now the second term of the RHS can be bounded as 

CM \p/M \p/M \p/M 

En/--/-ii«J <[T,(\\u\^ + \\f*n\\nj p ) <2 Eii/-HkJ +2 Eii/™ii«„ 
rn—1 / \ m — 1 / \ m — 1 / \ m — 1 

Therefore we have 

|ll/-rili 2( n) < ^v(tMC + 2aW (e IICIlO* ■ 



4- J " « M 

This gives the assertion 



\m— 1 



B Proof of Theorem [3] (minimax learning rate) 

Proo f: [Proof of Theorem [3] The proof utilizes the techniques develop ed bv (IRaskutti et all I2009L 
2010) that applied the information theoretic technique developed by (jYang and Barronl Il999f) to 
the MKL settings. The <5-packing number Q(5, H, L 2 (nj) of a function class H is the largest number 
of functions {/i, . . . , /q} C % such that — /j||i 2 (n) > 5 for all i ^ j. To simplify the notation, 
we write T := Ui p (R), N{e,U) := N(e,H, L 2 (U)) and Q{e,U) := Q(e,H, L 2 {Tl)). It can be easily 
shown that Q(2e,J r ) < N{2e, J 7 ) < Q(s,T). 

We utilize the following inequality given by Lemma 3 of iRaskutti et al.l (|2009D : 



mm max 



Ell/ - / || ia(n > - X V logQ(«„,^ J 



First we show the assertion for p = oo. In this situation, there is a constant C that depends only 
s such that 



log Q(c>, J") > CM log Q{5/V M,%{R)) 1 \ogN{e,F) < MlogN(e/VM,H(R)), 

(this is shown in Lemma 5 of IRaskutti et al.l (|2010|) , but we give the proof in Lemma H2] for com- 
pleteness). Using this expression, the minimax- learning rate is bounded as 

E||/-/*||Lrm>$|l 



2 > / 1 M log NjenNM, U{R)) + ne 2 n /2 + log 2 
7> S hV(X) "Mn)- 4 I MlogQ(8 n /VM,H(R)) 



Here we choose e n and <5„ to satisfy the following relations: 

^4 < M log JV (sn/VM, H(R)), (24) 
41ogiV (8 n /^M,U{R)) < ClogQ (5 n /^M,U{R)) ■ (25) 

With e n and 5 n that satisfy the above relations (|2"4"|) and (|23|) . we have 

Tr^) E|l/ " rllL(n) -i- (26) 

The relation (|24l) can be rewritten as 

< CM 



2o* " " Vi?A/M 
It is sufficient to impose 

with a constant C. The relation (1251) can be satisfied by taking S n — ce n with an appropriately 
chosen constant c. Thus Eq. d26l) gives 



min max E||/ - /*||£ 2(rl ) > Cn'mMR^ , (27) 



/ fen tp (R P 
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with a constant C . This gives the assertion for p — oo. 

Finally we show the assertion for 1 < p < oo. Note that %i oe {R/Mp) C We p (R) (this is because, 
for {x m }% =1 s .t. |x m | < R/Mp (Vm), we have |a; m | p < M (r/MpY = RP). Therefore we 

have 

m ? n ,,J?, ax ,^ E ll^ _ /*llL(n) ^ m i n max ,„ E ll/~ /*lll 2 (n) 



f 



f*EUe 00 R/Mp 



> Crr — M[R/Mp 



(Y Eq. (EH)) 



This concludes the proof. 



Lemma 12 There is a constant C such that 

log > CMlogQ(6/VM,H(R)), 

/or sufficiently small 5. 

Proof: The proof is analogous to that of Lemma 5 in (R askutti et all I2010T ) . We describe the 
outline of the proof. Let N = Q(V2S/VM, U{R)) and {&,...,/%} be a V^/v^-packing of 
T~i m {R)- Then we can construct a function class T as 



M 



T 



f j = J2ft \j = (h,...,3M)£{l,...,N} 



M 



We denote by [N] := {1, . . . , M}. For two functions f\ / 3 € T, we have by the construction 

M ■> 2S 2 M 

— P lli 2 (n) = ll/m™ — /™"lli 2 (n) — "w 7 l[?m 7^ f m ]- 

m—l m—1 

Thus, it suffices to construct a sufficiently large subset A c [A^] M such that all different pairs 
j,j' € A have at least M/2 of Hamming distance d H (j,j') ■= Em=i ^'m 7^ J,'J- 
Now we define d,H(A,j) := min 3 v e ^ dff(j If \A\ satisfies 



je[N] M d H {A,j)<^ 



<\[N] 



Mi 



N 



M 



(28) 



then there exists a member j' g [N] M such that j' is more than -y away from A with respect to 
dn, i.e. dH(A,j r ) > 4f . That is, we can add j' to A as long as Eq. (|2"5|) holds. Now since 



ie[iV] A/ d H (A,j) < — 



<\A\ 



M 
M/2 



N M/2 



(29) 



Eq. (|2"8)l holds as long as A satisfies 



1 N 



M 



Q* 



The logarithm of Q* can be evaluated as follows 

/ 1 N M \ 



log Q* = log 



2 (^ 2 )JVM/ 2 



= M log N - log 2 - log 



M \ M 



M/2) 



- y log 



> — log TV — log 2 - log 2 M > — log — . 
~2 ~216 



There exists a constant C such that N = Q(V25/VM,H(R)) > CQ{8 j '\[M \H{R)) because 
Q(S,'H(R)) ~ (-^) . Thus we obtain the assertion for sufficiently large AT. ■ 
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